Vladivostok
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Ye, Xi, Yin, Fangcong, He, Yinghui, Zhang, Joie, Yen, Howard, Gao, Tianyu, Durrett, Greg, Chen, Danqi
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc
North Korean troops in Ukraine 'fair game', US warns Russia as war rages on
United States defence secretary Lloyd Austin has waded in on reports that North Korea was preparing to enter the Ukraine war with troops. "If they are co-belligerents, if their intention is to participate in this war on Russia's behalf, that is a very, very serious issue," Austin said. Austin was returning from his fourth visit to Kyiv, where he announced a 400m package of US weapons for Ukraine. John Kirby, White House national security spokesman, said Washington believes that at least 3,000 North Korean soldiers arrived this month by sea to Vladivostok, Russia's largest Pacific port. "These soldiers then travelled onward to multiple Russian military training sites in eastern Russia, where they are currently undergoing training," Kirby said on Wednesday.
A Russian Jeopardy! Data Set for Question-Answering Systems
Question answering (QA) is one of the most common NLP tasks that relates to named entity recognition, fact extraction, semantic search and some other fields. In industry, it is much appreciated in chatbots and corporate information systems. It is also a challenging task that attracted the attention of a very general audience at the quiz show Jeopardy! In this article we describe a Jeopardy!-like Russian QA data set collected from the official Russian quiz database Chgk (che ge ka). The data set includes 379,284 quiz-like questions with 29,375 from the Russian analogue of Jeopardy! - "Own Game". We observe its linguistic features and the related QA-task. We conclude about perspectives of a QA competition based on the data set collected from this database.
Space warfare: US, China, and Russia are gearing up for the next frontier of armed conflict
Arthel Neville welcomes former U.S. Defense Intelligence Officer Rebekah Koffler to discuss the massive global cyberattack that had impacted several federal agencies. The next big war may be fought in space. As the Pentagon is gearing up for a future celestial conflict, so are our chief adversaries, China and Russia. Here's why "Star Wars" is no longer merely a topic of science fiction. The best way to avoid space warfare is to be ready for it. On Dec. 28, Elon Musk's Space X launched into space the Pentagon's highly secretive X-37B Orbital Test Vehicle, an unmanned reusable robotic spacecraft operated by the Air Force, in collaboration with Space Force.
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Roberts, Jonathan, Lüddecke, Timo, Sheikh, Rehan, Han, Kai, Albanie, Samuel
Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released.
English to Arabic machine translation of mathematical documents
Eddahibi, Mustapha, Mensouri, Mohammed
This paper is about the development of a machine translation system tailored specifically for LATEX mathematical documents. The system focuses on translating English LATEX mathematical documents into Arabic LATEX, catering to the growing demand for multilingual accessibility in scientific and mathematical literature. With the vast proliferation of LATEX mathematical documents the need for an efficient and accurate translation system has become increasingly essential. This paper addresses the necessity for a robust translation tool that enables seamless communication and comprehension of complex mathematical content across language barriers. The proposed system leverages a Transformer model as the core of the translation system, ensuring enhanced accuracy and fluency in the translated Arabic LATEX documents. Furthermore, the integration of RyDArab, an Arabic mathematical TEX extension, along with a rule-based translator for Arabic mathematical expressions, contributes to the precise rendering of complex mathematical symbols and equations in the translated output. The paper discusses the architecture, methodology, of the developed system, highlighting its efficacy in bridging the language gap in the domain of mathematical documentation
Multilingual Event Linking to Wikidata
Pratapa, Adithya, Gupta, Rishubh, Mitamura, Teruko
We present a task of multilingual linking of events to a knowledge base. We automatically compile a large-scale dataset for this task, comprising of 1.8M mentions across 44 languages referring to over 10.9K events from Wikidata. We propose two variants of the event linking task: 1) multilingual, where event descriptions are from the same language as the mention, and 2) crosslingual, where all event descriptions are in English. On the two proposed tasks, we compare multiple event linking systems including BM25+ (Lv and Zhai, 2011) and multilingual adaptations of the biencoder and crossencoder architectures from BLINK (Wu et al., 2020). In our experiments on the two task variants, we find both biencoder and crossencoder models significantly outperform the BM25+ baseline. Our results also indicate that the crosslingual task is in general more challenging than the multilingual task. To test the out-of-domain generalization of the proposed linking systems, we additionally create a Wikinews-based evaluation set. We present qualitative analysis highlighting various aspects captured by the proposed dataset, including the need for temporal reasoning over context and tackling diverse event descriptions across languages.
Russian tankers going dark raises flags on sanctions evasion
Russian tankers carrying oil chemicals and oil products are increasingly concealing their movements, a phenomenon that some maritime experts warn could signal attempts to evade unprecedented sanctions prompted by the invasion of Ukraine. In the week ending March 25, there were at least 33 occurrences of so-called "dark activity" -- operating while onboard systems to transmit their locations are turned off -- by Russian tankers, said Windward Ltd., an Israeli consultancy that specializes in maritime risk using artificial intelligence and satellite imagery. That's more than double the weekly average of 14 in the past year. The dark operations occurred mainly in or around Russia's exclusive economic zone, according to Windward, which conducted the research at Bloomberg's request. The ships engaging in dark activity include vessels connected to big corporations and multinational shipping firms, as well as small businesses, according to Windward.